Robert
Krüger, Institute for Visualization and Interactive Systems, University of
Stuttgart, kruegert@vis.uni-stuttgart.de
PRIMARY
Harald Bosch, Institute for Visualization and Interactive Systems, University
of Stuttgart, boschhd@vis.uni-stuttgart.de
Steffen Koch, Institute for Visualization and Interactive Systems, University
of Stuttgart, kochsn@vis.uni-stuttgart.de
Christoph Müller, Visualization Research Center, University of Stuttgart, mueller@visus.uni-stuttgart.de
Guido Reina, Visualization Research Center, University of Stuttgart, reina@visus.uni-stuttgart.de
Dennis Thom, Institute for Visualization and Interactive Systems, University of
Stuttgart, thomds@vis.uni-stuttgart.de
Thomas Ertl, Institute for Visualization and Interactive Systems, University of
Stuttgart, ertl@vis.uni-stuttgart.de
Student
Team: NO
custom,
developed by the Institute for Visualization and Interactive Systems and
Visualization Research Center of the University of Stuttgart.
Video:
Answers
to Mini-Challenge 1 Questions:
MC
1.1 Create a visualization of the health and policy status of the entire
Bank of Money enterprise as of 2 pm BMT (BankWorld Mean Time) on February 2.
What areas of concern do you observe?
In our
map view, showing locations and mean values of each facility, we observe the
majority of machines in most regions being in state “healthy” and
only some suffering from a moderate policy deviation. However, it is also
obvious that in region-5 and region-10 there is not a single machine that is in
a healthy state. Using temporal and spatial filtering on the map and histogram
views one can see that on 14:00 BMT in 16 facilities in the Atta region
(region-25) comprising 1,571 machines not a single machine sends state reports
although this is within business hours:
These
facilities are possibly completely offline. Datacenter-5 in region-10 also
shows a disproportionately high number of machines not logging their states.
Only 2,090 of 49,000 machines are reporting data here.
BoM employees do not seem to adhere to the BoM policy to turn off their
machines overnight whenever possible. We get an overview of this behavior in
the map from many state reports during night time by using the time slider
after filtering for workstation machines, further analyzing this behavior in
the matrix view:
map view displaying many machines logging at night; matrix
view (right) shows shut-down machines in black
Using
the swim lanes overview, one can observe that only a relatively small number of
machines exhibit critical policy deviations while a single outlier, a compute
server (172.2.194.20) in datacenter-2, already reports the
“infected” state:
The
matrix view in the image shows the machine state history and demonstrates that
the infection started at 03:30 local time without any suspicious activity
reported before. Starting 23:45 local time, the machine's state,
however, quickly rose from 2 to 4.
MC
1.2 Use your visualization tools to look at how the network’s
status changes over time. Highlight up to five potential anomalies in the
network and provide a visualization of each. When did each anomaly begin and
end? What might be an explanation of each anomaly?
Starting
12:15 BMT (09:15 local time) on the first day, we see whole facilities in Atta
going completely offline. Examining this behavior with the time slider we see
the first two facilities “disappear” in the south of Atta, and
subsequently the front of “disconnected/offline” facilities moving
north. Only 16 facilities in the very northwest of the region are still online
at 18:15 BMT:
The
first facilities start coming back on 23:30 local time from southwest to
northeast. In the microanalysis using the matrix view, we can see that some of
the machines in those facilities do not come back before 04:30 BMT. We assume
that this anomaly is caused by extraneous factors like a power outage or a
network failure rather than by a planned maintenance for several reasons:
This
supports the hypothesis of having detected an external anomaly. Although
protecting infrastructure against such events is an expensive task, BoM should
ensure that at least the critical backbone infrastructure is protected by a
sufficiently-dimensioned UPS and that there are redundant network paths as
losing connection to a whole business unit is hardly acceptable.
All
staff of Bank of Money is encouraged to turn off workstations at night.
However, this rule is only followed for ~60% of all workstations. In order to
discover relevant transitions between single states (e.g. policy) and combined
states (e.g. policy/connection) of a large number of machines, we employ an
aggregated state graph visualization. To examine the employees' workstation
behavior, we subsequently select all loan, teller and office machines and
observe their combined policy/state transitions.
Left: state transitions in office machines, right: teller
machines. Numbers indicate connection quantile/policy combinations.
At this point we
see an anomaly in connection numbers exhibited only by teller machines in
certain policy states. To further investigate this phenomenon we select these
machines and analyze their behavior over time using a parallel coordinates
visualization showing 29 teller machines with a disproportionately high number
of up to 100 connections on the first night between 02:15 and 05:00 local time.
By highlighting these machines in that view we can observe that they are all
associated with the same business unit and by looking on the map, we see them
distributed over region-10.
In the
following night at the very same time frame, already 893 of 2,548 machines,
including the 29 of the first night, show the same behavior. The inspection of
the logged activities does not reveal any specific event that could explain
this. We suggest letting region-10 staff investigate if the increase in
connections was caused by BoM infrastructure or not. In the latter case, the
machines' suspicious, parallel behavior might be caused by a bot net.
And, even worse, this would not have been detected as policy deviations, which
might pose the need of updating virus scanners and/or policy definition.
Our line
chart view indicates that a serious health state affection seems to grow
exponentially over the BoM network, which we interpret as a spreading virus.
Even more, our state transition graph shows that not a single machine improves
its policy state over time and that the degradation happens in a gradual
fashion without skipping states.
Observations
show that the current maintenance plan seems not to be effective: Although we
can see regular maintenance activities in the machines state histories, these
activities seem to have neither slowing nor stopping effects on the degradation
of policy states. Our set-based selection management tool reveals that of the
6,379 machines reporting the “infected” status, 1,075 have logged a
maintenance activity and these maintenance activities are evenly distributed
over all policy states.
Set-based selection management that helps (in)validating the
analyst's hypothesis. Note that none of the machines that receive
maintenance in policy state five ever improve in policy afterwards (top row).
This
distribution is the same for the ~336,000 machines staying healthy, i.e.
problematic machines are not handled with higher priority. Only a few machines
reach policy state 5 without reporting critical policy deviations before, and
these are all offline immediately before being infected. These transitions
however prove that the maintenance strategy in BoM is in dire need of
improvement. Currently, deteriorated machines cannot be recovered at all.
We could
not detect a direct relation between attached USB devices and subsequent virus
infection. We assume the early infection of datacenters might boost the
spreading of the infection.
The map
shows that region-5 and region-10 already have higher-than-average policy
states from the beginning. Investigating the histogram of these regions, it can
be seen that not a single machine has reported a healthy status. Instead, they
all start with moderate deviations. Besides this difference, the machines do
not seem to develop worse than the rest of the BoM network. We still think it
is indispensable that BoM reacts to such large-scale deviations in a timely
fashion, but detect no indication of measures taken from the sampled data.
At the
beginning of the time series, only three machines in datacenter-5 in region-10
report their state. All of them are office machines. The remaining 51,327
machines are offline. At 04:45 local time, 240 servers become online, but only
for one hour. The policy line plot shows that, in the following hours large
groups of machines start operating. About 15 percent of them start with a
moderate policy deviation, 1% with serious or critical deviations, and one
machine already reports a possible virus.
We
hypothesize that datacentre-5 is put into operation for the first time
wherefore a smaller number of servers is tested before the vast majority is
powered up.